purrr and broomThere are two main ways of working in R:
The base R has been around for decades now, it was made by statistians and software programmers for statistians. R is the evolution of S, a statistical programming language. It has some S legacy (less and less every update) and didn’t count for the “new data science” since this field, even being statistics and algebra, it is a pretty new field.
https://www.tidyverse.org/ is an opinionated collection of packages design for data science. They all - except ggplot2 - follow the same grammar and allow for mixing and matching, working pretty well together.
Some criticism:
Even with all the advantages we have with this bundle of packages in my experience working with big datasets and the tidyverse can be slow, so that’s why I recommend it for exploration and testing models. I’m sure the 98% of the times you are working with data you will be safe, it is not like we are managing big datasets or big data everyday.
In this workshop we are tackling data the tidy way, some functional programming will allow us to manage and explore many models at once. This goes beyond the usual exploration - modelling to use modelling as a form of data exploration.
You could argue that a model (almost by definition) is close to your initial assumptions, that’s why the exploration step is done. Here we are going to use modelling to learn about the data, not only predictions but behavior of different groups of data within a dataset.
The outline of this exercise is to create a dataframe where there’s one row per group an all data associated with that group is stored in a column of dataframes. Then, one can apply modelling techniques to each group and use the output to learn about each group.
First, I’ll need to load the data, and reshape it so that it’s easier to work with.
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(broom)
Depending on your computer limitations, I decided to work with only half of the observations. Nonetheless, you are welcome you use every observation or read a limited number of rows with the argument nmax in the read_csv function.
set.seed(42)
# Training data
data <- read_csv(file = "../data/clean.csv") %>%
sample_frac(1)
## Parsed with column specification:
## cols(
## .default = col_double(),
## article = col_character(),
## language = col_character(),
## domain = col_character()
## )
## See spec(...) for full column specifications.
At first glance of the data:
head(data)
## # A tibble: 6 x 806
## article language domain X2015.07.01 X2015.07.02 X2015.07.03 X2015.07.04
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 攻殼機動隊… zh wikip… 463 435 474 444
## 2 海尔·塞拉西… zh wikip… 75 39 37 62
## 3 Gaten_… en wikip… NA NA NA NA
## 4 Сифилис ru wikip… 730 784 695 611
## 5 Superm… en wikip… 9341 9844 9818 9124
## 6 Panzer… de wikip… 4 18 18 4
## # … with 799 more variables: X2015.07.05 <dbl>, X2015.07.06 <dbl>,
## # X2015.07.07 <dbl>, X2015.07.08 <dbl>, X2015.07.09 <dbl>,
## # X2015.07.10 <dbl>, X2015.07.11 <dbl>, X2015.07.12 <dbl>,
## # X2015.07.13 <dbl>, X2015.07.14 <dbl>, X2015.07.15 <dbl>,
## # X2015.07.16 <dbl>, X2015.07.17 <dbl>, X2015.07.18 <dbl>,
## # X2015.07.19 <dbl>, X2015.07.20 <dbl>, X2015.07.21 <dbl>,
## # X2015.07.22 <dbl>, X2015.07.23 <dbl>, X2015.07.24 <dbl>,
## # X2015.07.25 <dbl>, X2015.07.26 <dbl>, X2015.07.27 <dbl>,
## # X2015.07.28 <dbl>, X2015.07.29 <dbl>, X2015.07.30 <dbl>,
## # X2015.07.31 <dbl>, X2015.08.01 <dbl>, X2015.08.02 <dbl>,
## # X2015.08.03 <dbl>, X2015.08.04 <dbl>, X2015.08.05 <dbl>,
## # X2015.08.06 <dbl>, X2015.08.07 <dbl>, X2015.08.08 <dbl>,
## # X2015.08.09 <dbl>, X2015.08.10 <dbl>, X2015.08.11 <dbl>,
## # X2015.08.12 <dbl>, X2015.08.13 <dbl>, X2015.08.14 <dbl>,
## # X2015.08.15 <dbl>, X2015.08.16 <dbl>, X2015.08.17 <dbl>,
## # X2015.08.18 <dbl>, X2015.08.19 <dbl>, X2015.08.20 <dbl>,
## # X2015.08.21 <dbl>, X2015.08.22 <dbl>, X2015.08.23 <dbl>,
## # X2015.08.24 <dbl>, X2015.08.25 <dbl>, X2015.08.26 <dbl>,
## # X2015.08.27 <dbl>, X2015.08.28 <dbl>, X2015.08.29 <dbl>,
## # X2015.08.30 <dbl>, X2015.08.31 <dbl>, X2015.09.01 <dbl>,
## # X2015.09.02 <dbl>, X2015.09.03 <dbl>, X2015.09.04 <dbl>,
## # X2015.09.05 <dbl>, X2015.09.06 <dbl>, X2015.09.07 <dbl>,
## # X2015.09.08 <dbl>, X2015.09.09 <dbl>, X2015.09.10 <dbl>,
## # X2015.09.11 <dbl>, X2015.09.12 <dbl>, X2015.09.13 <dbl>,
## # X2015.09.14 <dbl>, X2015.09.15 <dbl>, X2015.09.16 <dbl>,
## # X2015.09.17 <dbl>, X2015.09.18 <dbl>, X2015.09.19 <dbl>,
## # X2015.09.20 <dbl>, X2015.09.21 <dbl>, X2015.09.22 <dbl>,
## # X2015.09.23 <dbl>, X2015.09.24 <dbl>, X2015.09.25 <dbl>,
## # X2015.09.26 <dbl>, X2015.09.27 <dbl>, X2015.09.28 <dbl>,
## # X2015.09.29 <dbl>, X2015.09.30 <dbl>, X2015.10.01 <dbl>,
## # X2015.10.02 <dbl>, X2015.10.03 <dbl>, X2015.10.04 <dbl>,
## # X2015.10.05 <dbl>, X2015.10.06 <dbl>, X2015.10.07 <dbl>,
## # X2015.10.08 <dbl>, X2015.10.09 <dbl>, X2015.10.10 <dbl>,
## # X2015.10.11 <dbl>, X2015.10.12 <dbl>, …
We have 3 clean columns:
And the other 803 are dates and number of views. To get this to be tidy we need:
Which will gives us:
# convert to long format.
data <- data %>%
gather(key = "date", value = "views", -article, -language, -domain)
tail(data)
## # A tibble: 6 x 5
## article language domain date views
## <chr> <chr> <chr> <chr> <dbl>
## 1 Topic:T2szzp4bocj97lno <NA> mediawiki X2017.09.10 NA
## 2 Help:CirrusSearch/ja <NA> mediawiki X2017.09.10 6
## 3 Перестройка ru wikipedia X2017.09.10 384
## 4 Австралия_на_«Евровидении» ru wikipedia X2017.09.10 10
## 5 Шведов,_Денис_Эдуардович ru wikipedia X2017.09.10 163
## 6 Елена_Стефановна ru wikipedia X2017.09.10 77
Now, our dataframe of 28,000 rows and ~800 columns has been reshaped to a dataframe of ~22,500,000 rows and 5 columns.
The next step is to make sure each of the columns are of the right data type.
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 22484000 obs. of 5 variables:
## $ article : chr "攻殼機動隊" "海尔·塞拉西一世" "Gaten_Matarazzo" "Сифилис" ...
## $ language: chr "zh" "zh" "en" "ru" ...
## $ domain : chr "wikipedia" "wikipedia" "wikipedia" "wikipedia" ...
## $ date : chr "X2015.07.01" "X2015.07.01" "X2015.07.01" "X2015.07.01" ...
## $ views : num 463 75 NA 730 9341 ...
date is a character which doesn’t look right, to use it as a date we will have to remove the leading X character and convert the remaining to date. We also need to convert from character to date class but we are going to wait a bit before doing that.
data <- data %>%
mutate(date = str_replace(pattern = "X", replacement = "", date))
head(data)
## # A tibble: 6 x 5
## article language domain date views
## <chr> <chr> <chr> <chr> <dbl>
## 1 攻殼機動隊 zh wikipedia 2015.07.01 463
## 2 海尔·塞拉西一世 zh wikipedia 2015.07.01 75
## 3 Gaten_Matarazzo en wikipedia 2015.07.01 NA
## 4 Сифилис ru wikipedia 2015.07.01 730
## 5 Superman en wikipedia 2015.07.01 9341
## 6 Panzerkampfwagen_IV de wikipedia 2015.07.01 4
Just so we’re aware - how many values are missing? Later we are applying an ARIMA model which doesn’t do well on missing values.
7.0753291 % of the view counts are missing. We can’t be sure if this means that there were zero views for that page on that day, or if the view count data is simply missing for that day.
dataframes - a dataframe for every page.Now, we transform the dataset into a dataframe such that every page + language + domain populates one row, and the corresponding data is stored in a list column. This is a three-liner with the tidyr::nest() function. A word of warning - creating list columns with the tidyr::nest() function is very slow if one of your columns is of type Date (see this Github issue for more details). As such, I will first nest the dataframes, and then convert the date column in each nested dataframe (initially a character datatype) to a native date format.
# dataframes within dataframes
nested <- data %>%
group_by(article, language, domain) %>%
nest()
head(nested)
## # A tibble: 6 x 4
## article language domain data
## <chr> <chr> <chr> <list>
## 1 攻殼機動隊 zh wikipedia <tibble [803 × 2]>
## 2 海尔·塞拉西一世 zh wikipedia <tibble [803 × 2]>
## 3 Gaten_Matarazzo en wikipedia <tibble [803 × 2]>
## 4 Сифилис ru wikipedia <tibble [803 × 2]>
## 5 Superman en wikipedia <tibble [803 × 2]>
## 6 Panzerkampfwagen_IV de wikipedia <tibble [803 × 2]>
Let’s take a look to the first observation:
nested %>%
head(1) %>%
select(data) %>%
.$data
## [[1]]
## # A tibble: 803 x 2
## date views
## <chr> <dbl>
## 1 2015.07.01 463
## 2 2015.07.02 435
## 3 2015.07.03 474
## 4 2015.07.04 444
## 5 2015.07.05 415
## 6 2015.07.06 378
## 7 2015.07.07 379
## 8 2015.07.08 378
## 9 2015.07.09 321
## 10 2015.07.10 427
## # … with 793 more rows
Now we can transform the date to a date type.
nested <- nested %>%
mutate(data = map(.f = ~mutate(., date = as.POSIXct(date,
format = "%Y.%m.%d"))
, .x = data))
We have everything like we wanted, we can delete our first dataframe to free some memory.
head(nested %>% head(1) %>% .$data)
## [[1]]
## # A tibble: 803 x 2
## date views
## <dttm> <dbl>
## 1 2015-07-01 00:00:00 463
## 2 2015-07-02 00:00:00 435
## 3 2015-07-03 00:00:00 474
## 4 2015-07-04 00:00:00 444
## 5 2015-07-05 00:00:00 415
## 6 2015-07-06 00:00:00 378
## 7 2015-07-07 00:00:00 379
## 8 2015-07-08 00:00:00 378
## 9 2015-07-09 00:00:00 321
## 10 2015-07-10 00:00:00 427
## # … with 793 more rows
rm(data)
The point of this notebook is to show how to use modeling for data exploration. But even so - I can’t jump into modeling without knowing what the data looks like at the least.
I’m most interested in identifying structure in the data that will influence my choices of parameters in models to come, or which models to use. For example, I’d be interested in identifying any trend the pages could have. I’d also be interested in identifying any outliers or extreme values. This may motivate me to transform my data in some way, or use models that are robust to extreme values.
Some interesting questions are:
nested %>%
group_by(language) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n) * 100) %>%
filter(freq > 1) %>%
drop_na() %>%
ggplot(aes(x = language, y = sort(freq), fill = language)) +
geom_col() +
ylab("Proportion of pages with language") +
theme(legend.position = "")
Using this extraction method, we can see those languages with more than 1% of total pages (droping NAs):
The remaining are other, we can classify now and make our lives easier.
nested <- nested %>%
ungroup() %>%
mutate(language = ifelse(
language %in% c("de", "en", "fr", "ru", "zh", NA), language, "other"))
nested %>%
group_by(domain) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n) * 100) %>%
drop_na() %>%
ggplot(aes(x = domain, y = sort(freq), fill = domain)) +
geom_col() +
ylab("Proportion of pages with language") +
theme(legend.position = "")
This seems to be working well. Now we can compare articles across different languages.
If there is clear trend in the data, it would be important to identify it early. Looking at the first time series in the nested dataframe as an example. We use a log-scale for the y-axis, we can see some more structure to mitigate the effect of large spikes.
# isolate the first page
first_page <- head(nested, 1)
# plot the series of the first page
first_page %>%
.$data %>%
.[[1]] %>%
ggplot(aes(x = date, y = views, color = views)) +
geom_point() +
geom_line() +
scale_y_log10()
# a function to plot a simple time series plot on a log scale for the y-axis
plot_series_log <- function(df, title = NULL){
df %>%
ggplot(aes(x = date, y = views, color = views)) +
geom_point() +
geom_line() +
labs(title = paste(title, "series", sep = " ")) +
scale_y_log10() +
theme(legend.position = "")
}
first_page %>%
.$data %>%
.[[1]] %>%
plot_series_log(title = first_page$article)
Trend sais about the article but what about the rest of them, how can they be compared?
Let’s try to compare them and get a more relevant page. To get a feel for how popular each page is, I’ll store average, median, and standard deviation of the daily veiws for each page in the nested dataframe.
To do so without unnesting the nested data in nested, I’ll use a higher order function, extract_metric. This function takes in a nested dataframe d, and a function used to calculalate a metric metric, and returns the that function applied to the views column of the nested dataset. I can then map this function with metric euqual to mean, median and function(l) sqrt(var(l) to get the average, median and standard deviation of of the views in each nested dataframe.
# apply an arbitrary metrix on the `views` column of of the nested data
extract_metric <- function(d, metric, ...){
metric(d$views, ...)
}
# map this H.O.F to get the average, median and standard deviation of views
nested <- nested %>%
mutate(average.views = map_dbl(.f = extract_metric,
.x = data, metric = mean, na.rm = TRUE),
median.views = map_dbl(.f = extract_metric, .x = data,
metric = median, na.rm = TRUE),
stddev.views = map_dbl(.f = extract_metric, .x = data,
metric = function(l) sqrt(var(l, na.rm = TRUE))))
Now we have the mean, median and standard deviation of the views stored in nested.
head(nested)
## # A tibble: 6 x 7
## article language domain data average.views median.views stddev.views
## <chr> <chr> <chr> <list> <dbl> <dbl> <dbl>
## 1 攻殼機動隊… zh wikipe… <tibb… 1178. 515 2524.
## 2 海尔·塞拉西一世… zh wikipe… <tibb… 90.7 58 835.
## 3 Gaten_Ma… en wikipe… <tibb… 4862. 3214. 5138.
## 4 Сифилис ru wikipe… <tibb… 1359. 1337 375.
## 5 Superman en wikipe… <tibb… 12581. 11092 5562.
## 6 Panzerka… de wikipe… <tibb… 11.8 7 54.7
Now we have the mean, median and standard deviation of the views stored in nested.
The mean or average is the expected value of a variable. (The density is a smoothed histogram).
nested %>%
ggplot(aes(x = average.views, fill = language)) +
geom_density(position = "stack") +
scale_x_log10() +
xlab("Average daily views (log scale)")
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 1 rows containing non-finite values (stat_density).
This means that the english pages expect mode daily views than the rest languages.
nested %>%
ggplot(aes(x = median.views, fill = language)) +
geom_density(position = "stack") +
scale_x_log10() +
xlab("Median daily views (log scale)")
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 245 rows containing non-finite values (stat_density).
The median is the number such that the probability is at least 1/2 (50%) that a randomly chosen point on the function will be less than or equal to the median, and the probability is at least 1/2 that a randomly chosen point on the function will be greater than or equal to the median.
nested %>%
ggplot(aes(x = stddev.views, fill = language)) +
geom_density(position = "stack", alpha = 0.5) +
scale_x_log10() +
xlab("Standard deviation of daily views (log scale)")
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 6 rows containing non-finite values (stat_density).
Standard deviation denotes how far apart all the numbers are in a set.
Now, arranging the views in terms of decending average view count:
nested %>%
arrange(desc(average.views))
## # A tibble: 28,000 x 7
## article language domain data average.views median.views stddev.views
## <chr> <chr> <chr> <list> <dbl> <dbl> <dbl>
## 1 Main_Page en wikip… <tibb… 37739050. 32965536 16027202.
## 2 Special:… en wikip… <tibb… 4341363. 3968131 2112402.
## 3 Wikipédi… fr wikip… <tibb… 1848044. 1936594 402801.
## 4 Special:… en wikip… <tibb… 573053. 467976 344495.
## 5 Заглавна… ru wikip… <tibb… 509975. 501972 153631.
## 6 404.php en wikip… <tibb… 395719. 1034 1108375.
## 7 Special:… en wikip… <tibb… 376041. 293257 340435.
## 8 Sp?cial:… fr wikip… <tibb… 349290. 369822 154725.
## 9 Spécial:… fr wikip… <tibb… 329868. 331357 68528.
## 10 Special:… <NA> wikim… <tibb… 285204. 212132. 295575.
## # … with 27,990 more rows
Now for the fun stuff - mapping models onto the nested data. As a first step, I will try and model the trend of the view count using simple linear regression. Then, looking at measures of model quality such as the \(R^2\), I can see which series are well explained with a linear trend, and which have more complex changes in mean.
To apply a linear model to each of the nested dataframes, I’ll first design a function that takes in a dataframe, and applies simple linear regression onto it. After, mapping this function onto each of the nested dataframes, we can get a new column, linear_trend, which stores linear models, fit onto each corresponding nested dataframe:
# a function for fitting SLR to an inptut dataframe
apply_lm <- function(df){
lm(data = df, views ~ date)
}
# fit a linear model to each page
nested <- nested %>%
mutate(linear_trend = map(.f = apply_lm, .x = data))
# isolate the first page
first_page <- head(nested, 1)
Now, along with a list column of the data for each page in a column, we also have a fitted linear model object stored in a seperate column for each wikipedia page:
nested %>%
head() %>%
select(article, data, linear_trend)
## # A tibble: 6 x 3
## article data linear_trend
## <chr> <list> <list>
## 1 攻殼機動隊 <tibble [803 × 2]> <lm>
## 2 海尔·塞拉西一世 <tibble [803 × 2]> <lm>
## 3 Gaten_Matarazzo <tibble [803 × 2]> <lm>
## 4 Сифилис <tibble [803 × 2]> <lm>
## 5 Superman <tibble [803 × 2]> <lm>
## 6 Panzerkampfwagen_IV <tibble [803 × 2]> <lm>
For example, if we wanted to see the summary of the first linear model fit:
nested %>% head(1) %>% .$linear_trend
## [[1]]
##
## Call:
## lm(formula = views ~ date, data = df)
##
## Coefficients:
## (Intercept) date
## -4.638e+04 3.235e-05
It’d be interesting to store a measure of model quality for each of these linear models - namely the \(R^2\) statistic. This will be helpful, as looking at each model’s \(R^2\) will help us highlight which Wikipedia pages exhibit clear linear trend, and which don’t (this might be hard to determine otherwise - can you think of a good way to do so?)
I’ll define a function extract_r2 - which uses the broom function to extract the \(R^2\) of a linear model. I’ll then map this function onto nested lm models to store the \(R^2\) for each model:
# a function for extracting only the R-squared statistics
extract_r2 <- function(model){
glance(model)$r.squared
}
# map this function onto each model to store the R^2
nested <- nested %>%
mutate(lm.r.squared = purrr::map_dbl(.f = extract_r2, .x = linear_trend))
Looking at the distribution of \(R^2\) across the different Wikipedia pages:
nested %>%
ggplot(aes(x = lm.r.squared)) +
geom_density()
## Warning: Removed 1 rows containing non-finite values (stat_density).
Most of the time series can not be explained well by a linear model, leading to low \(R^2\).
Some models have suspiciously high \(R^2\) values - I suspect this is because most of the data is missing, and thus a linear model can fit these sparse data more effectively. To test this hypothesis, I can plot the model with the highest \(R^2\):
# a funtion for plotting a time series, with a fitted linear trend line on top of it
plot_linear_trend <- function(df, title){
df %>%
ggplot(aes(x = date, y = views, color = views)) +
geom_point() +
geom_line() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = title) +
theme(legend.position = "")
}
nested %>%
arrange(desc(lm.r.squared)) %>%
head(1) %>%
mutate(chart = map2(.f = plot_linear_trend, .x = data, .y = article)) %>%
.$chart
## [[1]]
## Warning: Removed 801 rows containing non-finite values (stat_smooth).
## Warning: Removed 801 rows containing missing values (geom_point).
## Warning: Removed 801 rows containing missing values (geom_path).
Indeed - the model with the highest \(R^2\) has only 2 non-missing points - which can be fitted perfectly by a line.
But if we skip the pages with the 50 highest \(R^2\) values, we can really see that these models have a roughly linear trend:
nested %>%
arrange(desc(lm.r.squared)) %>%
filter(dplyr::between(row_number(), 50, 55)) %>%
mutate(chart = purrr::map2(.f = plot_linear_trend,
.x = data, .y = article)) %>%
.$chart
## [[1]]
## Warning: Removed 380 rows containing non-finite values (stat_smooth).
## Warning: Removed 380 rows containing missing values (geom_point).
## Warning: Removed 378 rows containing missing values (geom_path).
##
## [[2]]
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_path).
##
## [[3]]
## Warning: Removed 327 rows containing non-finite values (stat_smooth).
## Warning: Removed 327 rows containing missing values (geom_point).
## Warning: Removed 327 rows containing missing values (geom_path).
##
## [[4]]
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_path).
##
## [[5]]
## Warning: Removed 330 rows containing non-finite values (stat_smooth).
## Warning: Removed 330 rows containing missing values (geom_point).
## Warning: Removed 330 rows containing missing values (geom_path).
##
## [[6]]
Now, looking at the model with the lowest \(R^2\):
nested %>%
arrange(lm.r.squared) %>%
.[1,] %>%
mutate(chart = purrr::map2(.f = plot_linear_trend,
.x = data, .y = article)) %>%
.$chart
## [[1]]
## Warning: Removed 802 rows containing non-finite values (stat_smooth).
## Warning: Removed 802 rows containing missing values (geom_point).
## Warning: Removed 802 rows containing missing values (geom_path).
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
Now, looking at plots of series which don’t have extraordinarily high or low \(R^2\) (to avoid series with mostly missing values), we find some series that truly exhibit non-linear trends, resulting in low \(R^2\):
nested %>%
arrange(desc(lm.r.squared)) %>%
filter(dplyr::between(row_number(), 1000, 1005)) %>%
mutate(chart = purrr::map2(.f = plot_linear_trend,
.x = data, .y = article)) %>%
.$chart
## [[1]]
##
## [[2]]
## Warning: Removed 60 rows containing non-finite values (stat_smooth).
## Warning: Removed 60 rows containing missing values (geom_point).
## Warning: Removed 38 rows containing missing values (geom_path).
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]